-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gep: GEP-3440 - Gateway API Support for gRPC Retries #3441
base: main
Are you sure you want to change the base?
gep: GEP-3440 - Gateway API Support for gRPC Retries #3441
Conversation
Welcome @shadialtarsha! |
Hi @shadialtarsha. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@mikemorris already set me up for success here with his HTTP retries GEP. So this GEP is heavily inspired by his work. I assume this GEP might depend on the output of #3219 or at least the conformance tests need the timeout to test the backoff not exceeding the timeout. By the way this is my first PR so I hope I followed the process correctly 😅 |
Co-authored-by: Sotiris Nanopoulos <[email protected]>
Co-authored-by: Sotiris Nanopoulos <[email protected]>
Co-authored-by: Seth Epps <[email protected]>
- No standard APIs for advanced retry logic, such as integrating with rate-limiting headers. | ||
- No default retry policies for all routes within a namespace or for routes tied to a specific Gateway. | ||
- No support for detailed backoff adjustments, like fine-tuning intervals, adding jitter, or setting max duration caps. | ||
- No retry support for streaming or bidirectional APIs (maybe considered in future proposals). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this enforced in the API specification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling that out. The API doesn't have a way to enforce this non-goal.
I am thinking of three ways to do that:
- Adding a restriction in the API documentation clarifying that retries apply only to unary calls, with a potential future option to expand to streaming. Something among the line as:
// Note: **Retries are supported only for unary gRPC calls.**
// Implementations MUST NOT apply retries to streaming or bidirectional
// gRPC calls, as these types of calls are stateful and retrying them
// could result in data loss or duplication.
- Explicit Field: Add a UnaryOnly field (e.g., UnaryOnly bool) that makes it clear retries are restricted to unary calls.
- Remove this restriction and let users choose whether to apply retries on any gRPC call type.
Would like to hear your thoughts on this.
@shadialtarsha Thanks for this PR! We're currently in our release scoping phase for v1.3. To have this considered for scope in v1.3, please propose it in #3403. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but overall I think this is a good thing to do – thanks for starting in on this!! 🙂
4. **Non-Idempotent Requests** (`non_idempotent`): | ||
By default, Nginx does not retry non-idempotent requests (like POST or PUT) because they can cause side effects | ||
if sent multiple times. However, you can enable retries for non-idempotent requests if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply that you MUST do something special to get NGINX to retry gRPC at all?
2. **Retry Limits**: Traefik provides configurable retry attempts and can set a maximum number of retries. However, | ||
Traefik does not offer per-try timeout controls specific to each retry attempt. Instead, it typically relies on a | ||
global request timeout, limiting the flexibility needed for more precise gRPC retry management (like Envoy’s `per_try_timeout`). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linkerd supports gRPC retry as well: you MUST configure a GRPCRoute for Linkerd to understand that gRPC semantics are desired, but after that you can configure retries either on Routes or Services. See https://linkerd.io/2.17/reference/retries/.
gRPC retries with specialized logic, while other proxies rely on HTTP error codes, lacking the precision needed | ||
for gRPC. | ||
|
||
### Go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like to always see new API stuff described in mostly-English rather than in Go. I think you're saying this:
We're going to add a `retry` stanza to the GRPCRoute `rule`:
retry:
reasons: an array of gRPC status code names
attempts: an optional maximum number of retries, implementation-specific default
backoff: minimum time between retries as a GEP-2257 Duration, implementation-specific default
All of these are Extended.
I feel like we should always be able to describe new additions like this -- if we really can't easily describe the API in English, we're probably not designing it well in the first place. 🙂
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shadialtarsha The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind gep
What this PR does / why we need it:
Proposes configurations for gRPC retries within
GRPCRotue
.Which issue(s) this PR fixes:
Fixes #3440
Does this PR introduce a user-facing change?: